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ABSTRACT 


Big Data is a new technology and architecture. It can work on a very 
large volume of a variety of data with high-velocity, discovery, 
and/or analysis. Big Data is about the fast-growing sources of data 
such as web logics, Sensor networks, Social media, Internet text and 


Accepted Jul 25, 2018 documents, Internet pages, Search Index data, scientific research. Big 


data also formally introduces a complex range of analysis. Big data 
Keywords: can evaluate mixed data (structured and unstructured) from multiple 
sources. As there are some security issues in big data which are no 
longer solved using the hashing techniques on large amount of data, 
this paper shows an idea of new approach of designing a Knox’ ified 
Hadoop cluster. 
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1, INTRODUCTION 

An extensive variety of techniques and technologies has been developed to seize data, data storage, 
data analysis, search, sharing, transfer, visualization, querying, updating and information privacy. 

In this paper, the big data issues are more focused in terms of security issues that raised in Hadoop 
Architecture [1] base layer called Hadoop Distributed File System (HDFS) represented in Figure 1. The new 
Hadoop security design relies on the use of Knox [2] and Ranger. 
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Figure 1. Overview of Hadoop Distributed File System 


Journal homepage: http://iaescore.com/journals/index.php/ijeecs 


108 Oj 


2. SECURITY ROADMAP 
The Hadoop [1] supports few of the security features using Kerberos, firewalls, ACLS, LDAP etc., 
As Hadoop cluster [1] installation, Kerberos installations are very tough enough, providing security to 
Hadoop is also a major problem in the current situation. 
Security Roadmap shows the details of different technologies that are emerged with Hadoop today 
and are represented in Table 1. 
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Table 1. Survey of a security and its solution in Big Data 
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Table 2 shows the Security in hadoop today with five security pillars Administrator, Authentication, 
Authorization, Audit, Data protection. The current solutions are Apache Knox, Native Kerberos, Audit, 
Encryptions are the few solutions currently under work. From these solutions Knox is described in next 


section. 
Table 2. Security in Hadoop today 
S.NO. | SECURITY PILLARS CURRENT SOLUTIONS 
1. Administrator Apache Knox 
° Central Management & Consistent 
security 
Ds Authentication Apache Knox, Native Kerberos 
° Authenticate users and systems 
3. Authorization Apache Knox 
° Provision access to data 
4. Audit Apache Knox, Hadoop native audit 
° Maintain a record of data access 
5. Data Protection HDFS transparent, HBase encryption, Vendor 
° Protect data at rest & in motion solutions 
3. KNOX 


KNOX is developed by HortonWorks. Knox is a REST Representational State Transfer (It is 
sometimes spelled "ReST".) API gateway for interacting with hadoop services [11]. Apache Knox Gateway 
is a system that provides a single point of authentication and access for apache Hadoop services in a 
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cluster [12]. The aim is to simplify Hadoop security for both users and operators. The gateway runs as a 
server (or cluster of servers) that provide centralized access to one or more Hadoop clusters.It is designed to 
obscure hadoop cluster topology from outside world. Plugins for hadoop services includes WebHDFS, Oozie, 
Hive, Hbase, HCatalog. Knox [13] supports LDAP/Active Directory integration. It audits all Knox-managed 
gateway traffic. It also provides Service — level authorization to hadoop services. It has an End-to-End wire 
encryption via SSL. Bydefault Knox offers SSL encryption from the client to the Knox gateway [14]. A SSL 
setup is also possible between Knox and hadoop services [15]. 


3.1. Knox-Architechture 

This section in the paper shows the architecture of Knox which consists of one or more servers that 
sit outside the hadoop cluster. It is designed to replace SSH “edge-node” for accessing hadoop. It provides a 
single port to access Hadoop [16] services with a default port: 8443. It is designed to integrate with Kerberos 
& LDAP (Lightweight Directory Access Protocol) to handle authentication services [25-27]. A Knox’ified 
Hadoop is shown in Figure 2. 
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Figure 3. Extend Hadoop API Reach with KNOX 


3.2. Goals of Knox [12] 
Knox provides, 
a) perimeter security[17] for Hadoop REST API’s to make Hadoop security easier to set up and use.1.e., 
curl —I —k —u guest:guest password —x 
GETV https://localhost:8443/gateway/sandbox/webhdfs/vs/tmp/LISCENCE?0p=OPEN ‘“. 
b) Authentication [18] and token verification at the perimeter by enabling authentication integration with 
enterprise and cloud identity management systems. 
c) Service level authorization at the perimeter. 
d) Itexposes a single URL hierarchy that aggregates REST APIs of a Hadoop cluster. 
e) Knox securely extends the reach of Hadoop[19] APIS to anyone on any device. 
f) Serves as a gateway for Hadoop’s REST API. Different Rest APIs varying levels of authentication, 
authorization, SSL and SSO capabilities. 
g) It avoids exposing the cluster port and host names to all users. 
New Apache Knox Features in HDP 2.2: 
a) Knox can be installed by using Ambari. It can start and stop a configuration. 
b) It provides a new support for: YARN REST API, HDFS HA, SSL to HADOOP[20] cluster services 
(WEBHDES, HBASE, HIVE, OOZIE). 
c) Ithas Knox Management REST API. 
d) Integrates with Apache Ranger for service level Authorization. 


3.3. Knox-Rest Hierarchies 

It provides a single REST hierarchy for all Hadoop services. Normal HADOOP[21] has different 
HOSTS, different PORTS and _ exposes the details about the cluster topology  viz., 
http://namenode:50070/webhdfts/, "http://namenode:50070/webhdfs/..,http://hivenode: 10001/cliservice"..," http 
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://namenode:50070/webhdfs/..,http://hivenode: 10001/cliservice"http://hivenode: 1000 1/cliservice, 
http://localhost:11000/oozie. Whereas Knox has one HOST, one PORT, Consistent Structure viz., 
https://knox:8443/webhdfs, https://knox:8443/hive, https://knox:8443/oozie. Knox is only effective with 
proper perimeter security [22] configured. Knox’ified Hadoop cluster is in Figure 3. 
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Figure 4. Knox Architecture showing REST hierarchies 


Steps to have Apache Knox up and running against a Hadoop cluster: 
Verify system requirements. 
Download a virtual machine (VM) with Hadoop. 
Download apache Knox gateway. 
Start the virtual machine with Hadoop. 
Install Knox. 
Start the LDAP embedded within Knox. 
Start the Knox gateway. 
Do Hadoop with Knox. 
To get a file in HDFS via KNOX we use, 
Curl —I —k —u guest: guest —password —x GET\ 
‘https://localhost:8443/gateway/sandbox/webhdfs/v 1/tmp/LICENSE op=OPEN’. When curl command is 
used Kerberos [23], LDAP services [24] can be integrated with KNOX. 


te 


3.4. Knox Configuration Using Ambari 
Go to Ambari, click on Add service and setup Knox by selecting Knox & click on next as shown in 
Figure 5. 
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Figure 5. Starting Knox Gateway 
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There will be a centralized master server 1.e., Knox Gateway, select it and more gateways can also 
be selected if required by selecting the drop down list as shown in the Figure 6. 
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Figure 6. Assigning Knox Gateway 


Now, goto customized services where user has to give a Knox Master secret input as shown in 
Figure 7. 
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Figure 7. Knox Master secret 


In Add services Wizard, select the configure identities where we have to configure Knox by 
selecting the checkbox Knox as shown in Figure 8. 
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Figure 8. Knox service wizard 
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In this configurations window just select proceed anyway to deploy Knox as shown in Figure 9. 
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Figure 9. Configuration of different services 


Now, Knox will be deployed once clicking on deploy button as shown in Figure 10. 
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Figure 10. Deploying the services 


Once deployed now, it takes some time to install all the services and its components on the cluster 
as shown in Figure 11. 
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Figure 11. Knox components 


After deploying all the services are now configured using Ambari. Installation is success after doing 
all the above said process as shown in Figure 12. 
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Figure 12. Installation success 


3.5. Restarting services in Knox using ambari 

Unless all the services are restarted, desired results cannot be obtained. If it shows an orange cycles 
near the services menu, then ensure that all the services has to be restarted. To do that click on orange cycles- 
>restart->restart all Affected->confirm restart All. So that, all the services will get restarted. Once Knox is 
installed, login to the host where Knox is been setup. It shows the $ 1.e., [ec2-user@ip-172-31-53-37 ~]$. 
After this use the following commands to see the defined properties for the Knox Gateway. 

$Cd /etc/knox/conf 

$ls —ltr 

These commands shows gateway-site.xml, which is represented in Figure.13,14. 

Knox topologies 


[ec2—user@ip-—172-—31-53-37 ~]$ cd /etc/knox/conf 
[ec2—user@ip-—172-—31-53-37 conf]$ ls -ltr 


1 root root 1436 Jul 14 2015 shell—log4j.properties 
1 root root 91 Jul 14 2015 README 
1 root root 1485 Jul 14 2015 knoxcli-—log4j.properties 
1 Knox Knox 2355 Apr 28 21:55 gateway—log4j.properties 
2 Knox Knox 4096 Apr 28 21:55 topologies 
1 Knox root 305 Apr 28 21:55 krbSJAASLogin. conf 
1 knox Knox 1718 Apr 28 21:55 ldap-—log4j.properties 
1 knox Knox 2765 Apr 28 21:55 users. ldif 
1 knox knox 865 May 12 19:41 gateway-site. xml 
[ec2—user@ip-—172-—31-53-37 conf]$ view gateway-site.xml 
[ec2—user@ip—172-—31-53-37 conf]$ cd topologies/ 
[ec2—user@ip-—172-—31-53-37 topologies]$ ls -ltr 
otal 16 
rw-r--r—— 1 knox knox 89 Jul 14 2015 README 
rw-r-—-r—— 1 knox knox 4422 Jul 14 2015 admin.xml 
rw-r——r—— 1 Knox Knox 3011 May 11 20:20 default.xml 
[ec2—user@ip—172-—31-53-37 topologies]$ JJ 





Figure 13. Knox configurations gatewaysite.xml, admin.xml, default.xml 


<!-—-Thu May 12 19:41:46 2016--> 
<configuration> 


<=property> 
<=name>gateway.gqateway.conf.dir</name> 
<value>dep Loyments</vaLlue> 
</property> 


<property> 
<=name>gateway.hadoop.kerberos.secured</name> 
<value>false</vatlue> 

</property> 


<=property> 
<name>gateway .path</name> 
<va lLlue>gateway</value> 
</property> 


<property> 
<=name>gateway.port</name> 
<va Lue>8443</value> 
</property> 


LE <property> 
“gateway-—site.xmLl" [readonly] [noeolt] 39L, 865C 





Figure 14. Knox-gatewaysite.xml 
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The name parameter specifies the external host names in a comma separated List. 
The value parameter specifies corresponding internal host names in a comma separated 


Note that when you are using Sandbox, the external hostname needs to be localhost, as 
out 
of box sandbox.xml. This is because Sandbox uses port mapping to allow clients to ca 
the 
Hadoop services using localhost. In real clusters, external host names would almost 
localhost. 
--> 
<provider> 
<role>hostmap</role> 
<name>static</name> 
<enab Led>t rue</enabled> 
<param><name>Loca Lhost</name><vaLue>sandbox, sandbox. hortonworks. com</value></para 
</provider> 


</gateway> 
<service> 


<role>KNOX</role> 
</service> 





/topology> 


Figure 15. Knox-admin.xml 


<topology> 
<gateway> 


<provider> 
<role>authentication</role> 
<name>ShiroProvider</name> 
<enab led>t rue</enabled> 
<param> 
<name>sessionT imeout</name> 
<va Lue>30</value> 
</param> 
<param> 
<name>main. LdapRealm</name> 
<value>org. apache. hadoop. gateway. shirorealm. KnoxLdapRealm</value> 
</param> 
<param> 
<name>main. LdapRealm. userDnTemplate</name> 
<value>uid={@}, ou=people, dc=hadoop, dc=apache, dc=org</value> 
</param> 
<param> 
<name>main. \dapRealm. contextFactory.url</name> 
<value>ldap: //ip—172-31-53-37.ec2. internal: 33389</value> 
“default.xml" [readonly] [noeol] 89L, 3011C 





Figure 16. Default.xml 
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Figure 17. Knox-Ambari 


3.6. Validating the Services that are Supported by Knox 

Validation can be done using Cd /etc/knox/conf for gateway.site.xml and topologies has to be used 
for validating admin.xml and default.xml. Then execute the command 

View users.Idif, and then 

Curl —I —v —k admin:admin —password —x GET ‘https://ip-172-31-53- 
37.ec2.internal:8443/gateway/hdp23/OOZIE/v 1/admin/build-version’. 

By using curl command all the services of Knox has to be validated. 
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4. CONCLUSION 

This paper shows that the security risks such as insufficient authentication, No privacy, No integrity, 
arbitrary code execution are all the part of Kerberos. So, Knox is introduced in this paper to overcome these 
security risks. Software’s such as Ambari, Rsingh, Puppet, Chef are the automated software’s for working 
with 150 nodes or more. 4000 to 6000 name node clusters can be formed using these software’s and 10000 
name nodes can be formed using puppet. Installation of ambari is shown in very detail in this paper and 
working will be shown in future work. 
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